990 research outputs found
Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging
In this paper we show that reporting a single performance score is
insufficient to compare non-deterministic approaches. We demonstrate for common
sequence tagging tasks that the seed value for the random number generator can
result in statistically significant (p < 10^-4) differences for
state-of-the-art systems. For two recent systems for NER, we observe an
absolute difference of one percentage point F1-score depending on the selected
seed value, making these systems perceived either as state-of-the-art or
mediocre. Instead of publishing and reporting single performance scores, we
propose to compare score distributions based on multiple executions. Based on
the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we
present network architectures that produce both superior performance as well as
are more stable with respect to the remaining hyperparameters.Comment: Accepted at EMNLP 201
Alternative Weighting Schemes for ELMo Embeddings
ELMo embeddings (Peters et. al, 2018) had a huge impact on the NLP community
and may recent publications use these embeddings to boost the performance for
downstream NLP tasks. However, integration of ELMo embeddings in existent NLP
architectures is not straightforward. In contrast to traditional word
embeddings, like GloVe or word2vec embeddings, the bi-directional language
model of ELMo produces three 1024 dimensional vectors per token in a sentence.
Peters et al. proposed to learn a task-specific weighting of these three
vectors for downstream tasks. However, this proposed weighting scheme is not
feasible for certain tasks, and, as we will show, it does not necessarily yield
optimal performance. We evaluate different methods that combine the three
vectors from the language model in order to achieve the best possible
performance in downstream NLP tasks. We notice that the third layer of the
published language model often decreases the performance. By learning a
weighted average of only the first two layers, we are able to improve the
performance for many datasets. Due to the reduced complexity of the language
model, we have a training speed-up of 19-44% for the downstream task
The precision of line position measurements of unresolved quasar absorption lines and its influence on the search for variations of fundamental constants
Optical quasar spectra can be used to trace variations of the fine-structure
constant alpha. Controversial results that have been published in last years
suggest that in addition to to wavelength calibration problems systematic
errors might arise because of insufficient spectral resolution. The aim of this
work is to estimate the impact of incorrect line decompositions in fitting
procedures due to asymmetric line profiles. Methods are developed to
distinguish between different sources of line position shifts and thus to
minimize error sources in future work. To simulate asymmetric line profiles,
two different methods were used. At first the profile was created as an
unresolved blend of narrow lines and then, the profile was created using a
macroscopic velocity field of the absorbing medium. The simulated spectra were
analysed with standard methods to search for apparent shifts of line positions
that would mimic a variation of fundamental constants. Differences between
position shifts due to an incorrect line decomposition and a real variation of
constants were probed using methods that have been newly developed or adapted
for this kind of analysis. The results were then applied to real data. Apparent
relative velocity shifts of several hundred meters per second are found in the
analysis of simulated spectra with asymmetric line profiles. It was found that
each system has to be analysed in detail to distinguish between different
sources of line position shifts. A set of 16 FeII systems in seven quasar
spectra was analysed. With the methods developed, the mean alpha variation that
appeared in these systems was reduced from the original
Dalpha/alpha=(2.1+/-2.0)x10^-5 to Dalpha/alpha=(0.1+/-0.8)x10^-5. We thus
conclude that incorrect line decompositions can be partly responsible for the
conflicting results published so far
Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks
Selecting optimal parameters for a neural network architecture can often make
the difference between mediocre and state-of-the-art performance. However,
little is published which parameters and design choices should be evaluated or
selected making the correct hyperparameter optimization often a "black art that
requires expert experiences" (Snoek et al., 2012). In this paper, we evaluate
the importance of different network design choices and hyperparameters for five
common linguistic sequence tagging tasks (POS, Chunking, NER, Entity
Recognition, and Event Detection). We evaluated over 50.000 different setups
and found, that some parameters, like the pre-trained word embeddings or the
last layer of the network, have a large impact on the performance, while other
parameters, for example the number of LSTM layers or the number of recurrent
units, are of minor importance. We give a recommendation on a configuration
that performs well among different tasks.Comment: 34 pages. 9 page version of this paper published at EMNLP 201
Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches
Developing state-of-the-art approaches for specific tasks is a major driving
force in our research community. Depending on the prestige of the task,
publishing it can come along with a lot of visibility. The question arises how
reliable are our evaluation methodologies to compare approaches?
One common methodology to identify the state-of-the-art is to partition data
into a train, a development and a test set. Researchers can train and tune
their approach on some part of the dataset and then select the model that
worked best on the development set for a final evaluation on unseen test data.
Test scores from different approaches are compared, and performance differences
are tested for statistical significance.
In this publication, we show that there is a high risk that a statistical
significance in this type of evaluation is not due to a superior learning
approach. Instead, there is a high risk that the difference is due to chance.
For example for the CoNLL 2003 NER dataset we observed in up to 26% of the
cases type I errors (false positives) with a threshold of p < 0.05, i.e.,
falsely concluding a statistically significant difference between two identical
approaches.
We prove that this evaluation setup is unsuitable to compare learning
approaches. We formalize alternative evaluation setups based on score
distributions
The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes
Information Retrieval using dense low-dimensional representations recently
became popular and showed out-performance to traditional sparse-representations
like BM25. However, no previous work investigated how dense representations
perform with large index sizes. We show theoretically and empirically that the
performance for dense representations decreases quicker than sparse
representations for increasing index sizes. In extreme cases, this can even
lead to a tipping point where at a certain index size sparse representations
outperform dense representations. We show that this behavior is tightly
connected to the number of dimensions of the representations: The lower the
dimension, the higher the chance for false positives, i.e. returning irrelevant
documents.Comment: Published at ACL 202
Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora
Cross-document event coreference resolution (CDCR) is an NLP task in which
mentions of events need to be identified and clustered throughout a collection
of documents. CDCR aims to benefit downstream multi-document applications, but
despite recent progress on corpora and system development, downstream
improvements from applying CDCR have not been shown yet. We make the
observation that every CDCR system to date was developed, trained, and tested
only on a single respective corpus. This raises strong concerns on their
generalizability -- a must-have for downstream applications where the magnitude
of domains or event mentions is likely to exceed those found in a curated
corpus. To investigate this assumption, we define a uniform evaluation setup
involving three CDCR corpora: ECB+, the Gun Violence Corpus and the Football
Coreference Corpus (which we reannotate on token level to make our analysis
possible). We compare a corpus-independent, feature-based system against a
recent neural system developed for ECB+. Whilst being inferior in absolute
numbers, the feature-based system shows more consistent performance across all
corpora whereas the neural system is hit-and-miss. Via model introspection, we
find that the importance of event actions, event time, etc. for resolving
coreference in practice varies greatly between the corpora. Additional analysis
shows that several systems overfit on the structure of the ECB+ corpus. We
conclude with recommendations on how to achieve generally applicable CDCR
systems in the future -- the most important being that evaluation on multiple
CDCR corpora is strongly necessary. To facilitate future research, we release
our dataset, annotation guidelines, and system implementation to the public.Comment: Accepted at CL Journa
Key Recovery Attack on QuiSci
This paper shows a key recovery attack on QuiSci (quick stream cipher), designed by Stefan Müller (FGAN-FHR, a German research institute) in 2001. With one or few know plaintexts it\u27s possible to recover most of the key with negligible time complexity.
This paper shows a way how to exploit the weak key setup of QuiSci
Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks
There are two approaches for pairwise sentence scoring: Cross-encoders, which
perform full-attention over the input pair, and Bi-encoders, which map each
input independently to a dense vector space. While cross-encoders often achieve
higher performance, they are too slow for many practical use cases.
Bi-encoders, on the other hand, require substantial training data and
fine-tuning over the target task to achieve competitive performance. We present
a simple yet efficient data augmentation strategy called Augmented SBERT, where
we use the cross-encoder to label a larger set of input pairs to augment the
training data for the bi-encoder. We show that, in this process, selecting the
sentence pairs is non-trivial and crucial for the success of the method. We
evaluate our approach on multiple tasks (in-domain) as well as on a domain
adaptation task. Augmented SBERT achieves an improvement of up to 6 points for
in-domain and of up to 37 points for domain adaptation tasks compared to the
original bi-encoder performance.Comment: Accepted at NAACL 202
GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval
Dense retrieval approaches can overcome the lexical gap and lead to
significantly improved search results. However, they require large amounts of
training data which is not available for most domains. As shown in previous
work (Thakur et al., 2021b), the performance of dense retrievers severely
degrades under a domain shift. This limits the usage of dense retrieval
approaches to only a few domains with large training datasets.
In this paper, we propose the novel unsupervised domain adaptation method
Generative Pseudo Labeling (GPL), which combines a query generator with pseudo
labeling from a cross-encoder. On six representative domain-specialized
datasets, we find the proposed GPL can outperform an out-of-the-box
state-of-the-art dense retrieval approach by up to 9.3 points nDCG@10. GPL
requires less (unlabeled) data from the target domain and is more robust in its
training than previous methods.
We further investigate the role of six recent pre-training methods in the
scenario of domain adaptation for retrieval tasks, where only three could yield
improved results. The best approach, TSDAE (Wang et al., 2021) can be combined
with GPL, yielding another average improvement of 1.4 points nDCG@10 across the
six tasks. The code and the models are available at
https://github.com/UKPLab/gpl.Comment: Accepted at NAACL 202
- …